Skip to content

feat: Add API compliance monitoring for conversation events#2204

Open
csmith49 wants to merge 18 commits intomainfrom
api-compliance-monitor
Open

feat: Add API compliance monitoring for conversation events#2204
csmith49 wants to merge 18 commits intomainfrom
api-compliance-monitor

Conversation

@csmith49
Copy link
Collaborator

@csmith49 csmith49 commented Feb 25, 2026

Summary

Implements an APIComplianceMonitor that watches events as they're added to conversations and detects API compliance violations. Violating events are logged and rejected (not added to the event log).

Problem

Events sometimes arrive out-of-order, duplicated, or missing, causing API compliance issues. The existing view properties system (ToolCallMatchingProperty, etc.) is a post-hoc enforcement mechanism that removes violating events after they've been added. This causes information loss and state inconsistency.

Solution

Add pre-emptive validation that detects violations as events are added. Events with violations are logged and rejected before being added to the event log.

New Components

  • ComplianceState: Shared state tracking pending/completed tool calls
  • ComplianceViolation: Represents detected violations with context
  • APIComplianceMonitor: Checks events and tracks tool-call state

Property Coverage

The monitor checks a single unified property that covers all 8 API compliance test patterns:

Patterns Description
a01, a03, a04, a07 Messages interleaved with pending tool calls
a02, a06, a08 Tool results referencing unknown IDs
a05 Duplicate results for same tool_call_id

Integration

  • Added add_event() to ConversationState with compliance checking
  • Updated LocalConversation callback to use add_event() instead of direct events.append()
  • Monitor is lazily initialized and stateful (tracks tool call state across events)
  • Events with violations are logged and rejected

Testing

  • 45 tests covering properties, monitor, and integration
  • All existing conversation tests pass

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:1d57da7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-1d57da7-python \
  ghcr.io/openhands/agent-server:1d57da7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:1d57da7-golang-amd64
ghcr.io/openhands/agent-server:1d57da7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:1d57da7-golang-arm64
ghcr.io/openhands/agent-server:1d57da7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:1d57da7-java-amd64
ghcr.io/openhands/agent-server:1d57da7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:1d57da7-java-arm64
ghcr.io/openhands/agent-server:1d57da7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:1d57da7-python-amd64
ghcr.io/openhands/agent-server:1d57da7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:1d57da7-python-arm64
ghcr.io/openhands/agent-server:1d57da7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:1d57da7-golang
ghcr.io/openhands/agent-server:1d57da7-java
ghcr.io/openhands/agent-server:1d57da7-python

About Multi-Architecture Support

  • Each variant tag (e.g., 1d57da7-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 1d57da7-python-amd64) are also available if needed

Implements an APIComplianceMonitor that watches events as they're added to
conversations and detects API compliance violations. Currently operates in
observation mode (violations are logged but events continue processing).

New components:
- APICompliancePropertyBase: Base class for compliance properties
- ComplianceState: Shared state tracking pending/completed tool calls
- ComplianceViolation: Represents detected violations with context
- APIComplianceMonitor: Orchestrates checking across properties

Four properties covering all 8 API compliance patterns:
- InterleavedMessageProperty: a01, a03, a04, a07
- UnmatchedToolResultProperty: a02, a06
- DuplicateToolResultProperty: a05
- ToolResultOrderProperty: a08

Integration:
- Added add_event() to ConversationState with compliance checking
- Updated LocalConversation callback to use add_event()

The design supports future per-property reconciliation strategies.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 25, 2026

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)

============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.11.5 against 1.11.4
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): load_public_skills
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): temperature
::warning file=openhands-sdk/openhands/sdk/llm/llm.py,line=196,title=LLM.top_p::Attribute value was changed: `Field(default=1.0, ge=0, le=1)` -> `Field(default=None, ge=0, le=1, description='Nucleus sampling parameter. Defaults to None (uses provider default). Set to a value between 0 and 1 to control diversity of outputs.')`
::error title=SemVer::Breaking changes detected (1); require at least minor version bump from 1.11.x, but new is 1.11.5

============================================================
Checking openhands-workspace (openhands.workspace)
============================

Action log

@csmith49 csmith49 marked this pull request as ready for review February 25, 2026 21:00
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid design solving a real problem, but some architectural concerns around event addition safety.

Verdict: Core logic is sound. Address the dual-path event addition issue to prevent bypassed compliance checks. Data structure improvements are optional but recommended.

@csmith49
Copy link
Collaborator Author

@OpenHands take a look at the review comments and responses, implement the suggested changes

@openhands-ai
Copy link

openhands-ai bot commented Feb 25, 2026

I'm on it! csmith49 can track my progress at all-hands.dev

- Replace all_tool_call_ids field with computed property in ComplianceState
  to avoid maintaining redundant state (pending and completed already tracked)
- Simplify ToolResultOrderProperty.check() to use single membership check
  instead of three redundant checks
- Update InterleavedMessageProperty.update_state() to not maintain removed field
- Clarify add_event() docstring that it is the only supported way to add events
  and direct mutation of events list bypasses compliance monitoring

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 25, 2026

Summary

I've successfully addressed all the review feedback on PR #2204. The changes have already been pushed to the api-compliance-monitor branch (commit b9a79107).

Changes Made

  1. ComplianceState in base.py: Converted all_tool_call_ids from a redundant field to a computed @property that derives from set(pending_tool_call_ids) | completed_tool_call_ids

  2. ToolResultOrderProperty in properties.py: Simplified three redundant membership checks to a single event.tool_call_id not in state.all_tool_call_ids check

  3. InterleavedMessageProperty in properties.py: Removed the now-unnecessary state.all_tool_call_ids.add() call since the field no longer exists

  4. add_event() in state.py: Updated docstring to clearly document that this is the only supported way to add events and direct mutation bypasses compliance monitoring

Verification

  • ✅ All pre-commit checks pass (ruff, pyright, etc.)
  • ✅ All 33 compliance tests pass
  • ✅ Changes pushed to remote branch
  • ✅ All 4 review threads resolved with commit references

@csmith49
Copy link
Collaborator Author

@OpenHands please address the current round of unresolved comment

@openhands-ai
Copy link

openhands-ai bot commented Feb 25, 2026

I'm on it! csmith49 can track my progress at all-hands.dev

- Strengthen type from Event to LLMConvertibleEvent in compliance checks
  (base.py, monitor.py, properties) as we only check LLM-convertible events
- Remove _violations field from APIComplianceMonitor since violations are
  returned and logged per call (callers can accumulate if needed)
- Move compliance import to top of state.py instead of lazy import
- Refactor properties.py into a module with separate files for each property:
  - properties/interleaved_message.py
  - properties/unmatched_tool_result.py
  - properties/duplicate_tool_result.py
  - properties/tool_result_order.py
- Update add_event() to only check compliance for LLMConvertibleEvent instances
- Update tests to match new design (use returned violations and log assertions)

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 25, 2026

Summary

I've successfully addressed all 4 unresolved review comments on PR #2204. The changes have been pushed to the api-compliance-monitor branch (commit 28d034db) and all review threads have been resolved.

Changes Made

Review Comment Resolution
Strengthen type to LLMConvertibleEvent (base.py:75) Updated check() and update_state() signatures throughout the compliance module; add_event() now checks isinstance(event, LLMConvertibleEvent) before monitoring
Remove _violations field (monitor.py:43) Removed _violations, violations, violation_count, and clear_violations() - violations are returned per call and logged
Move import to top (state.py:528) Moved APIComplianceMonitor import to file top; verified no circular imports
Refactor into module (properties.py:1) Split into properties/ module with 4 separate files: interleaved_message.py, unmatched_tool_result.py, duplicate_tool_result.py, tool_result_order.py

Checklist

  • ✅ All 4 review comments addressed
  • ✅ Tests updated to match new design (32 tests pass)
  • ✅ Pre-commit checks pass
  • ✅ Changes pushed to remote
  • ✅ All review threads resolved with explanatory comments

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solid monitoring infrastructure with two fixable design issues

Linus-Style Assessment:

"Good Taste" Rating: 🟡 The core data structures are clean and the property interface is simple, but you're violating a fundamental principle: one concept, one check.

What's Good:

  • Clean data structure: ComplianceState tracks exactly what you need (pending + completed tool calls)
  • No deep nesting - all property checks are 2 levels max ✓
  • Pragmatic "observation mode" approach - gather data before implementing reconciliation
  • Solving a real problem (out-of-order/duplicate events in production)

Critical Issues to Fix:

  1. Missing Error Handling (🔴 Must Fix): If a property check throws, you break the "observation mode" contract and events won't be added. Defensive programming needed here.

  2. Duplicate Property Checks (🟠 Important): ToolResultOrderProperty and UnmatchedToolResultProperty perform identical checks (tool_call_id not in state.all_tool_call_ids) with different labels. The docstring claims semantic difference but the code cannot distinguish between "will arrive later" vs "never existed" - it's the same state check. This is poor taste. Either merge them or give them truly different conditions to check.

Verdict: Core logic is sound. Fix the error handling to guarantee observation mode works. Eliminate the duplicate check - don't track the same concept twice with different names.

See inline comments for details.

@csmith49
Copy link
Collaborator Author

@OpenHands Address the two error handling comments by implementing the suggestions.

@openhands-ai
Copy link

openhands-ai bot commented Feb 27, 2026

I'm on it! csmith49 can track my progress at all-hands.dev

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

- Wrap process_event() call in state.py with try-except to ensure events are
  always added regardless of compliance check failures
- Add try-except in monitor.py check_event() and update_state() to handle
  exceptions from individual properties gracefully
- Add test_monitor_handles_buggy_property_gracefully test to verify buggy
  properties don't crash the monitor

This ensures observation mode is robust: events are always processed even if
compliance checking fails.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 27, 2026

Summary

I addressed the two error handling review comments on PR #2204. The changes have already been pushed to the api-compliance-monitor branch (commit ce3213b9).

Checklist

Requirement Status
✅ Add try-except in state.py around process_event() Done - ensures events always added regardless of compliance failures
✅ Add try-except in monitor.py for buggy property handling Done - in both check_event() and update_state()
✅ Add test for buggy property error handling Done - test_monitor_handles_buggy_property_gracefully()
✅ All tests pass 8/8 tests pass in test_monitor.py
✅ Pre-commit hooks pass Format, lint, pyright all pass
✅ Changes pushed Pushed to api-compliance-monitor branch
✅ Review threads resolved Both error handling threads resolved

Files Modified

  1. openhands-sdk/openhands/sdk/conversation/state.py - Added try-except around compliance monitor call
  2. openhands-sdk/openhands/sdk/conversation/compliance/monitor.py - Added try-except in check_event() and update_state()
  3. tests/sdk/conversation/compliance/test_monitor.py - Added new test for buggy property handling

All changes are concise and directly address the two error handling review comments. No extraneous changes were made.

openhands-agent and others added 4 commits February 27, 2026 20:48
Refactor the compliance monitoring system to use a simpler, more elegant
design based on the insight that all checks can be framed as 'what events
are allowed right now':

- If we have pending tool calls, only matching observations are allowed
- Messages cannot interleave with pending tool calls
- Tool results must reference known tool_call_ids

Changes:
- Remove separate property classes (InterleavedMessageProperty, etc.)
- Remove APICompliancePropertyBase abstract class
- Remove properties/ module entirely
- Add _check_tool_call_sequence method to APIComplianceMonitor
- Simplify ComplianceState (remove all_tool_call_ids - derived when needed)

This eliminates the duplicate check issue where ToolResultOrderProperty and
UnmatchedToolResultProperty performed identical checks with different labels.
One method now handles all 8 API compliance patterns (a01-a08).

Co-authored-by: openhands <openhands@all-hands.dev>
- Create tests/integration/api_compliance/patterns.py with all 8 compliance
  patterns (a01-a08) as CompliancePattern dataclass instances
- Add tests/sdk/conversation/compliance/test_pattern_detection.py with 17
  tests that verify the APIComplianceMonitor catches all patterns
- Refactor a01-a08 integration test files to use the shared patterns module

This enables:
1. Fast unit tests that verify monitor catches violations (no LLM calls)
2. Separate integration tests that verify LLM API responses
3. Single source of truth for pattern definitions

Co-authored-by: openhands <openhands@all-hands.dev>
Previously the API compliance monitor operated in observation mode,
logging violations but still adding events to the event log. Now events
with compliance violations are rejected (not added to the event log).

The warning is still logged, but the violating event is not added.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid monitoring infrastructure, but critical documentation/behavior mismatch needs fixing.

Eval Risk Flag: This PR changes event processing behavior (filters/rejects violating events). Needs human review + lightweight evals before approval.

Key Issue: Documentation claims "observation mode" but code actually rejects events with violations.

csmith49 and others added 2 commits March 3, 2026 09:22
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
@csmith49
Copy link
Collaborator Author

csmith49 commented Mar 3, 2026

@OpenHands Please read the review comment threads and address the documentation mismatches, as well as the type of the events property.

@openhands-ai
Copy link

openhands-ai bot commented Mar 3, 2026

I'm on it! csmith49 can track my progress at all-hands.dev

- Fix garbled/duplicate text in compliance/__init__.py docstring
- Change events property to return Sequence[Event] instead of EventLog
  to discourage direct mutation and enforce use of add_event()
- Add docstring to events property explaining the read-only intent

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Mar 3, 2026

Final Summary

I've successfully addressed the review comments on PR #2204. The changes have been pushed to the api-compliance-monitor branch (commit d816d440).

Checklist

Fixed documentation mismatch in compliance/__init__.py: Removed garbled/duplicate text from the module docstring. It now clearly states that violations are "logged and rejected."

Changed events property type in state.py: Changed return type from EventLog to Sequence[Event] to discourage direct mutation, with a docstring directing users to add_event().

Resolved both unresolved review threads via GitHub GraphQL API

All 45 compliance tests pass

Pre-commit hooks pass (ruff lint, pyright)

Changes Made

  • openhands-sdk/openhands/sdk/conversation/compliance/__init__.py - Fixed docstring
  • openhands-sdk/openhands/sdk/conversation/state.py - Changed events property return type and added docstring

The changes are minimal and focused solely on addressing the review feedback.

@csmith49 csmith49 requested a review from all-hands-bot March 3, 2026 16:35
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Clean algorithm, but integration has critical issue: silent event loss could cause production debugging nightmares.

Eval Risk Flag: This PR changes event processing behavior (filters/rejects events). Per repo guidelines, needs human review + lightweight evals before approval.

Verdict: Core logic is sound, but silent event rejection needs addressing. See inline comments.

csmith49 and others added 3 commits March 3, 2026 09:40
…on.py

Co-authored-by: OpenHands Bot <contact@all-hands.dev>
…rn value

- Monitor now uses fail-closed semantics: if checking crashes, state is not
  updated (event is treated as violating)
- add_event() now returns bool indicating if event was actually added
- Added fail-closed handling in state.py: if compliance check crashes, reject event
- Updated tests to verify fail-closed behavior and return values

Co-authored-by: openhands <openhands@all-hands.dev>
…istBase

- Change ConversationState.events return type from Sequence[Event] to EventsListBase
- Add get_index abstract method to EventsListBase for event lookup by ID
- Implement get_index in RemoteEventsList class

This fixes pyright type errors where code was calling .append() and .get_index()
on state.events, which are not available on Sequence[Event].

Co-authored-by: openhands <openhands@all-hands.dev>
# Use add_event() to check API compliance before appending.
# Events with violations are logged and rejected (not added to event log).
# Violations are logged but events are still processed.
self._state.add_event(e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when we modify self._state we use a lock... so I wonder, does it mean that the compliance code is under it? It must be, right 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ i think so, we hold the lock when default_callback is executed.

@OpenHands please review the code to confirm this. Do not change any code, just answer my question.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

I was asked to review the code to confirm whether the API compliance monitoring code runs under a lock when self._state is modified.

Answer: Yes, the compliance code does run under the lock.

Here's the evidence from the code:

  1. ConversationState implements a FIFOLock via __enter__/__exit__ context manager methods (lines 506-513 in state.py).

  2. agent.step() is called inside with self._state: (lines 561, 619-621 in local_conversation.py), which acquires the lock.

  3. Inside agent.step(), the on_event callback is invoked, which eventually calls _default_callback.

  4. _default_callback calls self._state.add_event(e) (line 190), which runs the compliance check via self.compliance_monitor.process_event(event).

  5. The code explicitly documents this in the comment on lines 183-185:

    # This callback runs while holding the conversation state's lock
    # (see BaseConversation.compose_callbacks usage inside `with self._state:`
    # regions), so updating state here is thread-safe.

Conclusion: The compliance monitoring state modifications are thread-safe because _default_callback (which calls add_event()compliance_monitor.process_event()) runs while the FIFOLock on ConversationState is held.

No code changes were made as per the user's instructions.

) # FIFO lock for thread safety
_compliance_monitor: APIComplianceMonitor | None = PrivateAttr(
default=None
) # API compliance monitor (lazy-initialized)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder - are we doing any sort of serialization here? Because typically the base state is serialized and saved into base_state.json. What will happen if we do a serialization followed by deserialization here?

@enyst
Copy link
Collaborator

enyst commented Mar 4, 2026

@OpenHands Do a /codereview-roasted on this PR. Understand its goal first.

Publish your review here in the PR so we can see it.

@openhands-ai
Copy link

openhands-ai bot commented Mar 4, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Taste rating: needs improvement

Goal (as I understand it)

Enforce LLM API-compliant tool-call sequencing at event-ingest time (instead of post-hoc filtering), so malformed sequences never enter the conversation event log.


[CRITICAL ISSUES] (must fix)

  1. Monitor state is not reconstructed on resume → valid events get rejected after restart
  • Where: openhands-sdk/openhands/sdk/conversation/state.py (resume path, ~lines 318–353), openhands-sdk/openhands/sdk/conversation/compliance/monitor.py (36–38)
  • Why it’s bad: ComplianceState is in-memory only. After resuming a persisted conversation, the next ConversationState.add_event() starts with an empty monitor state. If the conversation was persisted mid-tool-call (ActionEvent already recorded, ObservationEvent not yet recorded), the first ObservationEvent after restart will look “unmatched” and be rejected.
  • Concrete fix: On resume (or on first monitor creation), rebuild ComplianceState by scanning existing state.events and applying the same transitions (ActionEvent→pending, Observation→complete). Don’t “reject” historical violations during rebuild; you just need correct current pending/completed state.
  1. add_event() return value is ignored; state is mutated as if the event was stored
  • Where: openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py 181–197
  • Why it’s bad: _default_callback calls self._state.add_event(e) and then unconditionally updates last_user_message_id. If the MessageEvent was rejected for compliance, you now have derived state pointing at an event that doesn’t exist in the log.
  • Concrete fix: Check the bool return. Only do follow-on state updates (like last_user_message_id) when the event was actually added.
  1. “Fail-closed” is documented, but the implementation is fail-open in the common case
  • Where:
    • openhands-sdk/openhands/sdk/conversation/state.py 556–569 says “reject event if compliance check crashes”
    • openhands-sdk/openhands/sdk/conversation/compliance/monitor.py 143–160 catches check exceptions and returns []
    • tests/sdk/conversation/compliance/test_monitor.py 134–159 asserts violations == [] when _check_tool_call_sequence throws
  • Why it’s bad: If the checker throws internally, process_event() swallows it and returns no violations → ConversationState.add_event() will accept and append the event. That’s neither fail-closed nor predictable.
  • Concrete fix: Pick one policy and make it consistent end-to-end:
    • Actually fail-closed: convert check exceptions into a violation (or re-raise so add_event() rejects), and test for rejection.
    • Fail-open (probably safer): accept the event but loudly log “monitor broken; compliance disabled” and stop pretending events were validated.
  1. Rejecting out-of-order events can permanently poison the conversation
  • Where: monitor.py unmatched observation path 82–109 + ConversationState.add_event() rejecting on violations 539–571
  • Why it’s bad: If an ObservationEvent arrives before its ActionEvent (the exact “out-of-order” scenario described in the PR), you reject the ObservationEvent. When the ActionEvent later arrives, it becomes “pending” forever because the result was thrown away.
  • Concrete fix: If you expect out-of-order delivery, you need buffering/reconciliation (or at least “store but don’t send to LLM”), not data loss.

[IMPROVEMENT OPPORTUNITIES] (should fix)

  • Unbounded growth of completed_tool_call_ids

    • Where: .../conversation/compliance/base.py 40–41, monitor.py 125–126
    • This set grows without bound with every tool call. Might be fine, but it should be an explicit tradeoff (or bounded).
  • Potentially breaking public typing surface

    • Where: ConversationState.events now returns EventsListBase (state.py 216–223)
    • If external users rely on EventLog-specific API (even just for typing), this is a breaking change that likely needs a deprecation/migration story.

[TESTING GAPS]

  • Add a persistence/resume test:

    1. Create state with persistence
    2. Add ActionEvent (pending)
    3. Re-create state via resume path
    4. Add matching ObservationEvent
    5. Assert it is not rejected
  • Add a test for LocalConversation default callback: rejected events must not update derived state like last_user_message_id.


VERDICT

Needs rework before I’d be comfortable turning “reject mode” on by default.

Key insight

The compliance state machine must be reconstructible from the durable data structure (the event log). If it can’t do that, it’s not robust — it’s just another source of inconsistency.

@openhands-ai

This comment was marked as duplicate.

Copy link
Collaborator

enyst commented Mar 4, 2026

Renewed review (second pass, fresh eyes)

I re-read my own review + the diff with a calmer head. Some points stay “must fix”, a couple should be downgraded to “design tradeoff / clarify intent”, and I also missed one real consistency hole in the callback pipeline.


Recalibration: what’s actually major?

Still CRITICAL (I wouldn’t ship reject-mode without these)

  1. Resume/restart correctness: monitor state not reconstructible from the event log
  • Where: openhands-sdk/openhands/sdk/conversation/state.py resume path (~318–353) + .../conversation/compliance/monitor.py
  • Why it’s still major: This is the classic “works in the happy path, breaks on restart” bug. A persisted conversation can be mid-tool-call when the process dies. After restart, a correct tool result can get rejected as “unknown tool_call_id” simply because the monitor forgot history.
  • Bottom line: If the compliance logic depends on prior events, it must be derivable from the durable structure (the event log). Otherwise it’s not monitoring; it’s roulette.
  1. ConversationState.add_event() returning False is currently too easy to ignore → state inconsistency + silent stalls
  • Where: .../conversation/impl/local_conversation.py default callback (181–197)
  • Why it’s still major: The default callback doesn’t check the bool and still mutates last_user_message_id. That’s an actual inconsistency: derived state can reference an event that wasn’t persisted.
  • Even worse: if events get rejected and the rest of the system proceeds as if they happened, you’re going to get “pending forever” tool calls or agents stuck waiting for observations.
  1. Exception policy is inconsistent (doc says fail-closed; behavior is effectively fail-open)
  • Where: .../conversation/compliance/monitor.py catches check exceptions and returns []; ConversationState.add_event() rejects only if process_event() raises.
  • Severity: Still major, but it might be “docs/tests are wrong” rather than “code is wrong”. Pick a policy and align:
    • If you want fail-open: document it as fail-open and consider disabling the monitor after repeated internal errors.
    • If you want fail-closed: treat a check exception as a violation (or raise) so add_event() can reject.

New CRITICAL issue I missed in the first pass

  1. Rejection happens after other callbacks run → side effects / UI can observe events that are not in the log
  • Where: callback composition in LocalConversation.__init__ and BaseConversation.compose_callbacks().
    • Visualizer callbacks are prepended (composed_list = [viz.on_event] + composed_list).
    • compose_callbacks() doesn’t support short-circuiting; it calls every callback unconditionally.
  • Why it matters: With “reject” semantics, you’ve introduced a split-brain:
    • downstream consumers (visualizer, user callbacks, hooks, etc.) can react to an event
    • but the canonical conversation history never contains it.

If the intent is “rejected events still exist as transient signals but are not persisted”, that needs to be explicit and tested. If the intent is “rejected events should not be observable anywhere”, the compliance check must happen before callback fan-out.


Items I’d downgrade (still worth addressing, but not blockers by themselves)

  • Out-of-order observation poisoning the conversation

    • I still think “rejecting a result that arrived early” can cause a permanent pending state. But on a second read: if this system assumes “Observation before Action” is always corruption (pattern a08), rejecting may be an acceptable hard stop.
    • Actionable ask: clarify whether the runtime ever expects out-of-order delivery. If yes → need buffering/reconciliation. If no → document that rejecting is intentional and that callers must handle False return.
  • Unbounded growth of completed_tool_call_ids

    • Real, but probably acceptable until proven otherwise. Still worth a comment in code (or a bound) because this is exactly the kind of “long-lived conversation eats memory” footgun.
  • Typing surface / “read-only events”

    • ConversationState.events is called “read-only” but returns EventsListBase, which includes append(). This is more a “doc doesn’t match interface” issue than a functional break.

Additional “might bite later” gap

  • Action events are always allowed (monitor.py line ~63).
    • For the current 8 patterns, that matches what you test.
    • But if you ever want to enforce “no new assistant message with tool calls before all results”, you’ll likely need to use llm_response_id to distinguish “parallel tool calls from one response” vs “a second assistant response while pending”. Right now the monitor can’t tell.

Updated verdict

Still 🔴 needs improvement. The core idea (pre-ingest checking) is fine, but “reject mode” isn’t robust until:

  • the monitor is reconstructible on resume,
  • callers can’t accidentally ignore rejections,
  • exception semantics are consistent,
  • and the callback pipeline can’t observe events that the log rejects (or you explicitly define that as intended).

@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @csmith49, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants